STA4173 Lecture 1, Summer 2023
In this lecture, we will review summary statistics
Continuous variables
Categorical variables
In this course, we will review formulas, but we will use R for computational purposes
Remember to refer to the lecture notes for specific code needed
Code is also available on this course’s GitHub repository
We can use base R for some things, but other things we will use the tidyverse and janitor packages.
If we need to install packages, we use the install.packages() function,
library() function,\bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}
s^2 = \frac{\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}}{n-1}
s = \sqrt{s^2}
[1] 107.1111
[1] 10.34945
Definition: median
The value that lies in the middle of the data when arranged in ascending order.
If n is odd, then the median is literally the middle number.
If n is even, then the median is the average of the two middle numbers.
R syntax:
Definition: kth percentile, Pk
Definition: quartiles
Definition: five number summary
0% 25% 50% 75% 100%
62.00 71.75 79.50 87.00 94.00
Definition: interquartile range
When we are dealing with categorical data, we summarize using frequency tables.
e.g., from the UWF Fact Book, in Fall 2021, there were
R syntax:
Consider the Motor Trends car road tests data, built into R.
The data was extracted from the 1974 Motor Trend magazine, and includes aspects of car design and performance for 32 cars (1973-74 models).
Let’s find the frequency tables for the number of cylinders and transmission type together.
First, with column percentages,
Let’s find the frequency tables for the number of cylinders and transmission type together.
Next, with row percentages,
When presenting results to others, sometimes it is helpful to create a visualization.
Continuous data:
Categorical data:
Related to analyses:
We can also use color to incorporate other variables
We will use the ggplot2 package for most of our graphing needs.
tidyverse package.A good reference book is the official ggplot2: elegant graphics for data analysis text.
I will often google keywords + ggplot2 and look for examples that provide code.
e.g., “histogram ggplot2” led me to this website
e.g., “change color of dot ggplot2” led me to this website
Sometimes I have to look at several links before I find what I am looking for.
ggplot() function to specify our underlying canvas.tidyverse pipe operator (%>%) to pipe data into the ggplot() function.aes() inside of ggplot().We will add elements to our graph using geom_ functions.
geom_line() creates a linegeom_point() creates a scatterplotgeom_bar() creates a bar chartgeom_text() puts text on the graphgeom_ functions on the tidyverse websiteThe order that you add them matters!
geom_line() + geom_point() = points on top of linegeom_point() + geom_line() = line on top of pointsWe can also customize every aspect of our graphs.
e.g., the default background is gray, but I personally do not like it, so I typically use theme_minimal() or theme_bw() to give a white background
e.g., we can increase the font size to make things readable
e.g., we can specify colors for: markers (dots/points), outline of a bar chart or histogram, filling of a bar chart or histogram, lines, text, etc.
There are additional functions within other (non-tidyverse) packages that will help us with customization.
We can put graphs together using the ggarrange() function in the ggpubr package
We can use geom_emoji() from the the emoGG to display emojis in graphs :)
I do not expect you to become an expert in data visualization
As with other R code, I will provide basic code during lecture
I do encourage curiosity and exploring further
R is a very, very powerful tool for graphing!
Even before I was An Official R Programmer©, I used ggplot2 to construct graphs.
Other programs are just not great. :(
Today we will look at graphs that go along with summary statistics, but we will learn other ways to graph data as we progress through the semester.
means <- mtcars %>%
group_by(cyl, am) %>%
summarize(mean = mean(mpg)) %>%
ungroup()
means %>%
ggplot(aes(y = mean, x = cyl, color = as.factor(am))) +
geom_point(size = 5) +
labs(x = "Horsepower",
y = "Gas Mileage",
color = "Transmission") +
scale_color_manual(labels = c("Automatic", "Manual"),
values = c("#003865", "#8DC8E8")) +
theme_bw() means <- mtcars %>%
group_by(cyl, am) %>%
summarize(mean = mean(mpg),
sd = sd(mpg)) %>%
ungroup()
means %>%
ggplot(aes(y = mean, x = cyl, color = as.factor(am))) +
geom_point(size = 5) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 0.15) +
labs(x = "Horsepower",
y = "Gas Mileage",
color = "Transmission") +
scale_color_manual(labels = c("Automatic", "Manual"),
values = c("#003865", "#8DC8E8")) +
theme_bw() Today we have reviewed how to describe data.
There is not a one-size-fits-all graph!
In lab, we will learn how to create a table of descriptives.
Next class, we will review statistical inference.